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Abstract 



This paper introduces a new Monte Carlo algorithm to invert large matrices. It is based on 
simultaneous coupled draws from two random vectors whose covariance is the required inverse. 
It can be considered a generalization of a previously reported algorithm for hermitian matrices 
inversion based in only one draw. The use of two draws allows the inversion on non-hermitian 
matrices. Both the conditions for convergence and the rate of convergence are similar to the 
Gauss-Seidel algorithm. Results on two examples are presented, a real non-symmetric matrix 
related to quantitative genetics and a complex non-hermitian matrix relevant for physicists. 
Compared with other Monte Carlo algorithms it reveals a large reduction of the processing 
time showing eight times faster processing in the examples studied. 
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1 Introduction 

The computation of the inverse of a matrix or some of its elements is one of the main topics in 
numerical analysis. Large sparse matrices can be usually inverted from its factors obtained by 
using sparse matrix techniques Although it can be alleviated by reordering rows and columns, 
these techniques suffer from the fill-in required by the LU factorization and they usually result in 
algorithms demanding large random access memory (RAM) values. 

Monte Carlo algorithms are used to estimate empirically the expectation of random variables. 
Using them to invert matrices consist basically in obtaining and averaging several realized values 
of a variable whose expectation is the required inverse. Although Monte Carlo algorithms present 
a stochastic error, they can be used to obtain estimates of the inverse when the matrix is too large 
or complicate to be inverted by using conventional or sparse algorithms. These algorithms have 
high parallel efficiency, they do not demand a great amount of RAM and the processing time is 
proportional to the number of nonzero elements of the matrix to be inverted. In general, they can 
be very efficient when rough estimates of the inverse matrix are required. 

The most common Monte Carlo algorithms to invert matrices are based on finite discrete 
Markov chains 5 . In these algorithms several random trajectories are calculated to solve a linear 
system of equations for each matrix column. Alternatively, there are Monte Carlo methods based 
on random variables whose expectation is the whole inverse instead one of its columns. Typically 
the rational is to consider the sought for inverse matrix C _1 as the covariance matrix of a normally 
distributed random vector, 



After generating by any Monte Carlo algorithm sampled values of z (something for which C _1 
is not needed but just C), the inverse matrix is obtained by an estimation of the variances and 
covariances of the sample. 
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Recently the Gibbs sampler algorithm has been proposed to draw z values from 
using the elements of C |SJ E| ■ The procedure consists in updating successively the elements of z 
for several cycles by using the formula 

^=^^-^t^o ij -^-t^ Cij (2) 

v 3=1 

where the subscript i means the i th element of z, the superscript (k) means the k th cycle and <j> is 
an independent random noise drawn from a standard normal distribution. After a given period of 
convergence usually called the burn-in period, will satisfy at each cycle E fz^z'") = C _1 

or E ^zt('s)Q z (' s )^ = trQCP 1 (the f denotes hermitian conjugate, i.e., transposition and complex 

conjugation). These expectations will provide the elements of the inverse or linear functions of 
them respectively. Hereafter, we will denote this algorithm by GS. Although GS does not provide 
an independent but a serially correlated set of samples, the convergence is fast and, contrary to 
Metropolis-like algorithms, it profits of all the draws. 

The obvious disadvantage of the algorithms based in Eq. is a rather narrow range of 
applicability because of the necessary condition of C being positive definite. For many statistical 
applications this is not a problem ^1] but it is a serious drawback in Physics applications (for 
instance in Lattice QCD |17|1. The problem can be bypassed by inverting C^C and applying 
the result to C but this procedure seriously reduces the efficiency Rather surprisingly this is not 
the case for the GS algorithm. Once recourse is made to the Gibbs sampler to implement the 
covariance matrix rational, the convergence of the algorithm, at least if it is explicitly written as 
in Eq. J5J), is governed by the sampling method and not by the implications of (QQ) . As we will 
see, the convergence conditions are then more flexible but stillCmust be symmetric (or hermitian 
if defined over the complex numbers). 

A different approach usually known as stochastic estimation (SE from now on), avoid this 
problem J^j by recourse to the repeated solution of linear systems like in the discrete Markov 
Chain methods mentioned above but this time the number of systems to be solved does not 
depend on the rank of the matrix to be inverted. The rational is to solve a series of linear systems 
Cv = (f> where (f> is drawn from a random variable satisfying E ((f)) — and E (</><^) = I. The 
inverse is then given by taking the ensemble average E (v0^) = C~ 1 . Rather surprisingly it was 
not recognized until recently that this method do not rely on the gaussianity of (j). Thus, Dong 
and Liu 6 and, independently, Garcfa-Cortes have shown that the efficiency of the method 
improves substantially when is drawn as <f>i = 2B — 1, where B is a Bernoulli random variable (a 
Z 2 noise). 

When applicable, however, it is expected that GS will be more efficient than SE. This comes 
from the fact that GS mimics Gauss-Seidel algorithm but including noise at each cycle ^Hj- The 
processing time to obtain each realized value of zz^ or z^Qz is then slightly greater than the 
processing time of a Gauss-Seidel iteration, while the processing time to obtain a SE realized 
value is equivalent to the processing time needed to solve a linear system. 

In this paper we introduce a new "noisy" Gauss-Seidel method which include as a particular 
case the GS algorithm. Our algorithm, that we will call it the correlated chains sampling algorithm 
(CC from now on), is applicable to nondefinite positive matrices and profits of the larger efficiency 
of the Z 2 noise. The power of the algorithm is assessed with two different non-hermitian matrix, 
a real one used in genetic improvement in animal breeding and a complex one typical of particle 
Physics. Numerical tests show how our proposal can be around an order of magnitude faster than 
SE. The sufficient and necessary conditions for the convergence of the method are also given. 

2 The correlated chains sampling algorithm 

The CC algorithm introduced in this section is based on the simultaneous update of a couple of 
random vectors z and w such as, after convergence of the algorithm, E (zw^) = C _1 being C the 
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matrix to be inverted. The algorithm updates simultaneously each element of z and w using 
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where is a set of independent noise vectors such as 

E = (5) 

and 

W$(*)$(0tj =iS kl . (6) 

Note that the noise term in (J3J) and in (0J are both the same. 

After discarding N cycles of a total of M as a period of convergence or burn-in, the Monte 
Carlo estimation of the inverse can be obtained from, 

M-N ^ K ' 

k=N+l 

Expression E (z^Qz) = irQC -1 also holds, and its Monte Carlo estimation via the CC algo- 
rithm is given by 

1 M 

M-N ^ w 

fc=JV+l 

2.1 Demonstration 

Let us start by rewriting Eqs. and Q in matrix form, 

Z W = JL $(*) _ I Lz « - luz^" 1 ) , (9) 

Vd d d w 



w 



_ -UWW - — rLtw^" 1 ) , (10) 
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where L, D and U are the lower triangle, the diagonal and the upper triangle of C respectively, 
so that L + D + U = C. Solving for z( k > and w^ ' 1 * from Eqs.© and lfTU|l we have, 

wW ,^ mtV5 _l__ w(l -, tL _l_. (12) 
After running the algorithm during n updating cycles lfTT|) and (fT2jl leads to, 

n 

z (n) _ ^(-T) fc 8 ( ™- fc) , (13) 

k=0 
n 

,(n)t _ V"r(»-fe)t (-S) fc , (14) 
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where for the ease of the notation we have defined, 
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for all fc except for k = for which, again for the ease of notation, 9^°^ and denote just 
the initial vectors z^ and w^ ^ respectively. For a bounded noise such us the Z 2 noise, z^ n > and 
w( n )are bounded by 
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where B is the upper bound of the noise $ and therefore the absolute convergence of the alternating 
power series T and S warrants the convergence of the algorithm. In fact, this is a necessary and 
sufficient condition for the convergence of the algorithm for any possible value of the noise chain. 
To see this notice that the worst case corresponds to a noise such as the components of T fe and S fc 
with the maximum absolute values attains positive signs. But the maximum of the absolute values 
of the components of a matrix is a norm. The convergence of these components, which warrants 
the convergence of any other of the components, corresponds therefore to the absolute convergence 
of the series under such a norm and all the norms are equivalent as far as the convergence of matrix 
series is concerned. 

Given the convergence of the alternating power series in T and S we have finite z 1 - 00 ) and w^ 00 - 1 
and for a given integer N, the z^ and w 1 ™) corresponding to n > N can be split in series of N 
terms plus and arbitrarily small remainder for a sufficiently large N, 
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The sample average of the zw' product discarding N burn-in cycles of a total of M ^S> N, again 
retaining explicitly only terms up to order N, gives 
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Taking the limit of large M, Eq. {03) yields, 
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In the limit of large N, Eq. (|2*U|> reduces to 
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Equation l|21|l represents the stationary value of the zw' average. That the series in 1(2111 
approach C^ 1 , when they converge, is easily verified by iteration of the following recursive formula, 

^DTL D DTu +T i S < W 

which can be derived from the identity, 

(C-U)i(C-L)=C-(U + L)+U^L, 
since for the particular case in which C = D + U + Lit reduces to, 

(D + L)i(D + U)-D + uiL. 
Making use of the definitions 115(1 and H16jl formula 122(1 follows trivially. 

2.2 Convergence analysis 

As shown in the previous section, the convergence of the algorithm is determined by the absolute 
convergence of the series, 



£(-T) fc , ( 23 ) 

k=0 
oo 

£(-S) fc . (24) 



k=Q 



We therefore center in searching for the necessary and sufficient conditions for the convergence of 
the series 1(23(1 and 1(241) . Restricting the analysis to the series P3)l as the same rational applies to 
l|24jl. first notice that the absolute convergence of 1(23(1 implies that the spectral radius of T, i.e., 

linifc^oo ||T fc || 1/fc = sp(T) EI, is strictly below 1. This comes as a consequence of the Cauchy root 
convergence test pQ which implies 1 1 1 1 1 < l for some sufficiently large m if ((23(1 converges. 
Conversely if sp(T) < 1, then HT" 1 )! 1 ^™ < 1 for some large enough m which in turn implies 
||T' m || < 1. We now split the series of the absolute values of ((23(1 in a finite part with the terms 
up to m — 1 and the rest, so that, 

oc m— 1 oo m— 1 oo 

^ ^ 1 1 1 1 ^ ^ 1 1 1 1 _|_ ^ ^ 1 1 rpm+fc 1 1 ^ ^ ^ 1 1 rj-ifc 1 1 _|_ ^ ^ 1 1 rjnm 1 1 k 

k=0 k=0 k=0 k=0 k=0 
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Given that ||T™ 1 1 < lthe last term converges to 1/(1 — ||T|| ) and therefore (|23|l converges absolutely. 
Applying the same rational to l|24[) we have then, that the algorithm converges in the sense that 
both z^ 00 ' and w^°°^ are finite for any possible drawn values of the noise <I>, if and only if sp(T) < 1 
and sp(S) < 1 . In strict sense, these are the necessary and sufficient conditions for the convergence 
of Eqs. (flip) and i|12|) . The final algorithm is implemented using © and (|10|l and therefore also 
the non-singularity of Dis needed. 

2.3 The burn-in period and Monte Carlo error 

As usual in this kind of algorithms, the efficiency of the code is increased by discarding a number of 
iterations, N, called burn-in period, which are too influenced by the initial values. Given that the 
CC path does not converge to a deterministic value but to a random variable an estimation of ./V 
is not as easy as in a deterministic iterative procedures (for instance, the Gauss-Seidel algorithm). 
To overcome the difficulties associated to the random nature of the algorithm we will use the so 
called coupling method |lf>| . It consists in running a couple of paths for both z and with 
different initial values but with the same random numbers for each couple. When the difference 
between paths of a couple reach a given tolerance, the burn-in is assumed to be finished. From 
© the difference between two paths Zi and Z2 with different initial values but with the same 
results in, 

z« - ,[« = -1l (4 k} - 4 k) ) - (4 k - 1] - 4 k ~ 1] ) ■ (25) 

But Eq. i|25[l is nothing else that the formula corresponding to a Gauss-Seidel iterative algorithm 
20 for the solution of the linear system, 

C(z 2 -zi)=0, 

whose solution is z 2 — zi = 0, for a non-singular C. Not surprisingly, the condition for the 
convergence of the Gauss-Seidel algorithm represented by (|25)l is, sp((D + L) _1 U) < 1 , that is, 
sp(T) < I. As expected, the same rational applied to Eqs. (|lf)|> leads to a Gauss-Seidel algorithm 
for C^(w2 — wi) — with a convergence condition sp((D^ + U^) L T ) < 1, that is, sp(S) < 1 
. These two Gauss-Seidel procedures can be useful to analyze the convergence rate of the CC 
algorithm in a given instance. 

It is also relevant to calculate the Monte Carlo error of the expectations Q) . The Monte Carlo 
error is Gaussian because of the central limit theorem, but the realized values provided by the 
CC algorithm are serially correlated and its Monte Carlo variance is bigger than the Monte Carlo 
variance expected from an independent set of M — N samples. To calculate the Monte Carlo error 
instead of M — N samples, we used a higher effective number of samples as proposed in |12| . 

2.4 Algorithm 

Algorithm 1 shows the correlated chains algorithm including the determination of the burn-in 
period (N). During the burn-in period, four chains are computed until the paths converge to a 
given tolerance, typically a number small enough to avoid the dependency of the chains on the 
starting values. In order to simplify the algorithm outlined here, we consider a fixed chain length 
(Af), i.e., we are assuming that the M — N cycles after burn-in are enough to obtain accurate 
estimates of the elements of the inverse. Nevertheless, in the numerical tests presented in this 
paper, we have included the determination of M — N by inserting the calculation of the effective 
number of cycles during the iteration process. 

Algorithm 1 Correlated chains algorithm including the determination of the burn- 
in PERIOD 

Given n, Q, C, B, tol 

1. Set arbitrary starting values for z, z*, w' and w'* ; for instance Zi—0, z*=i, Wi=0 and 
w*=i. 



(:> 



2. Sample 4> as a vector containing independent draws according with definition |3|) and 

3. Update z, z* , w' and w'* by using equations Q) and ^j) 
^. p <- p+ 1 

5. Go to step 2 until z'z* < tol and w'w* < tol 

6. Sample (j) as a vector containing independent draws according with definition |5|) and 0) 

7. Update z and w 6j/ using equations and O) 
<§. Accumulate z'Qw m s 

9. Go to step 6 to compute the next round of iteration (B — p times) 

10. Set the final estimate: irQC -1 s/ (B — p) 

3 Numerical tests 

To check the computational efficiency of the proposed algorithm we will compare its performance 
with respect to the efficiency of the SE algorithm which has shown itself as a efficient inver- 
sion method We will make use of two different examples from two disparate fields: genetic 
improvement in animal breeding and quantum field theory in the lattice. 

3.1 Example 1. Wu and Schaeffer's real asymmetric matrix 

The Henderson's mixed model equations JH] are routinely used in animal breeding to evaluate 
the candidates to the artificial selection in livestock populations. This method takes into account 
both the performance records of the animals in a given population and the pedigree relationships 
between animals. In animal breeding, the quality of the estimations provided by a given model 
corresponds to the diagonal elements of the inverse of a coefficient matrix. For instance, in the 
case of the simplest model, these diagonal elements, known as the prediction error variances, take 
the form 

\( xtx xt 

dmg { X I + A-^ 

where X is an incidence matrix mapping animals into herds. Its nonzero elements are Xij = 1, 
which indicates that animal i was recorded in herd j. I is the identity matrix with order equal 
to the number of animals in the population. Variances a\ and o~\ are assumed to be known and 
correspond to the additive genetic variance and the residual variance. A is known as the numerator 
relationship matrix and it maps the genetic relationships between animals. For instance, ay = 0.25 
if animals i and j are half-sibs, ay = 0.5 if animals i and j are parent-progeny related, an = 1 
if the animal i is not inbred, etc. The inverse of the numerator relationship matrix A -1 can be 
easily obtained by using the Henderson's rules |15| . but the whole coefficient matrix in equation 
(|26J) has to be explicitly inverted. 

The amount of animals used in a typical analysis is around hundreds of thousands and at least 
a matrix row per animal is needed so that the rank of the coefficient matrix in Eq. 1|26[1 is of that 
order. This coefficient matrix is positive definite and the Gibbs sampler based algorithm |14j can 
be easily implemented. Nevertheless, the artificial selection based on the Henderson's mixed model 
equations tend to select animals coming from a small number of families and the percentage of 
inbreeding increases significantly after a few generations of artificial selection. Wu and Schaeffer 
[T§| proposed an original method to select artificially the populations keeping the genetic gain 
very close to the Henderson's optimum but reducing significantly the rate of inbreeding. The 
numerator relationship matrix has to be replaced by a customary asymmetric relationship matrix 
A , and they provide simple rules to obtain A -1 . For each animal i in the pedigree, with sire s 
and dam d, 

1. Add ((1 — A) 8i + A) to the (i, i) position of A -1 , 

2. Add — (1 — A) Si/2 to the (i, s) and (i, d) positions of A -1 , 
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Tabic 1: Topics of the Wu and Schacffcr's coefficient matrix in two cases of different size. 



Case 


1 


2 


Number of animals 


50000 


fOOOOO 


Number of herds 


5000 


fOOOO 


Rank of the coefficient matrix 


55000 


ffOOOO 


Nonzero elements 


424978 


848982 




3.0 


3.0 


A 


0.2 


0.2 



3. Add —Si/2 to the (s,i) and (d,i) positions of A -1 , 

4. Add —Si/4 to the (s, s), (d,d), (s,d) and (d,s) positions of A -1 , 

where Si — 2 when both parents of the i th animal are known, Si =4/3 when one parent is known 
and St — 1 when both parents are unknown. A G [0, f] is a known coefficient which determine 
the weight of the family records in each animal genetic evaluation, for instance, setting A = 
correspond to the conventional Henderson's mixed model equations. These rules result in an 
asymmetric coefficient matrix for A ^ 0, non suitable for GS. Both SE and CC can be used to 
compute the inverse required in expression (|26|l even in cases where A is replaced by its non- 
symmetric counterpart A. We will now compare the performance of our algorithm against the SE 
estimator using two cases of Wu and Schaeffer's coefficient matrices as described in Tabled 

The correlated chains sampling was implemented by setting the tolerance for burn-in as 5 10 -5 . 
After the burn- in period was finished, realized values of ir(ztQw) were obtained for each cycle 
and averaged. Convergence after burn-in was checked every 100 th iteration. The standard error 
of the final estimate of E [tr(zTQw)] was obtained from the variance between realized values of 
tr(zTQw) and an effective chain length as described in 12 . Making recourse of this effective length 
comes as a consequence of the correlated nature of the realized values in the CC algorithm. At 
any rate, we tested the Geyer's method against an empirical standard error estimate obtained by 
replicating one hundred times the whole analysis varying the random seed number. The algorithm 
was assumed to reach the convergence when the standard error yielded a relative error smaller 
than the required tolerance of 5 10~ 5 . 

SE method was implemented by solving the linear systems by both the iterative Gauss-Seidel 
algorithm and the bi-conjugate gradient method. The latter resulted in a more efficient SE algo- 
rithm in terms of CPU time and it is the only one presented in this manuscript. 

Tabic |21 shows the results of both algorithms in cases 1 and 2. Relevant topics concerning 
the computing efficiency of the CC algorithm are the number of rounds computed to reach the 
burn-in tolerance, the number of rounds required to reach the required standard error of the final 
estimate, and the CPU time required to compute both each burn-in cycle (four chains) and each 
after burn- in cycle (two chains). Relevant topics for the SE algorithm are the average number 
of rounds required to solve each linear system via the bi-conjugate gradient method (maximum 
change between successive iterations was set to 5 10 -5 ), the number of linear systems to be solved 
in order to reach the required standard error of the final estimate and the CPU time per bi- 
conjugate gradient round of iteration. All CPU times in Table are referred to the time taken to 
complete a cycle in the CC algorithm in the smallest case. 

Table |21 shows also the agreement between the results provided by both the CC and the SE 
algorithms. The CC algorithm is vastly more efficient in terms of the final CPU time required for 
the computation of the inverse being around eight times faster then its SE counterpart. 

3.2 Example 2. The Dirac's free fermion complex non-hermitian matrix 

We now proceed further in assessing the performance of the CC algorithm by including complex 
matrix components as well as by strengthen the number of non-zero elements (in the millions 
range). The tolerance now set to fO -5 with respect to the absolute value rather than for real and 
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Tabic 2: Results of GC and SE on the two Wu and Schacffcr's matrices 



CC method 


SE method 




Case 1 


Case 2 




Case 1 


Case 2 


N 


103 


104 


Rounds per system 


34.63 


38.04 


M-N 


39200 


23800 


Total rounds 


285732 


202448 


Eff. length size 


15788 


8163 


Number of systems 


8251 


5322 


E [tr(z/'Qw)] 


10371 


20738 


E[tr(zf'Qw)} 


10371 


20738 


Var [ir (z/'Qw)] 


3932 


8123 


Var [ir(z/'Qw)] 


2391 


5301 


MC St. error 


0.499 


0.998 


MC St. error 


0.499 


0.998 


Empirical St. error 


0.502 


1.047 








CPU time per burn-in cycle 


1.97 


3.99 








CPU time per cycle 


1 


2.03 


CPU time per round 


1.03 


1.99 


CPU time per eff. cycle 


2.48 


5.92 


CPU time per system 


35.69 


75.80 


Total CPU time 


39403 


48729 


Total CPU time 


294303 


402872 



imaginary part independently. The chosen example is very well representative of the computational 
demanding tasks typical of the physical sciences. It belongs to the realm of elementary particle 
physics known as Lattice Quantum Chromodynamics briefly described in the following. 

In elementary particle physics the evolution of the particles is described in terms of a field, 
tjj (x), over the space-time (x, is a four components vector, one representing time, the other three 
the spatial coordinates) which governs the annihilation and creation of particles. More specifically 
if) {x)\s an operator whose action on the elements of its domain (a Hilbert space) represents the 
annihilation of a particle at time xoand spatial coordinates {x%, x%, X3) . Conversely, the action 
of the adjoin field ify (x) (the hermitian conjugate in a matrix representation of if> (x)), describes 
the creation of a particle at space-time coordinates x . For a class of particles called 1/2-spin 
Fcrmions, such as the electron, -0is in turn a four components object (each component being an 
operator) called a spinor, an object without a classical analog (it is not a vector since under a 
27rrotation changes its sing). Free evolution of such particles, i.e., without interaction with other 
particles, is governed by the Dirac's equation which in the so called Euclidean representation reads 

LtP= (d IM j» + m)i> = 0, (27) 

where m is the mass of the particle, fi runs over the four spatio-temporal coordinates, denotes 
derivative along the fj, coordinate, summation over repeated indexes is assumed and 7^ is a set of 
four non-hermitian matrices of dimension four satisfying 

where d$ denotes a four by four unit matrix multiplied by the standard Kronecker delta. Alter- 
natively the evolution can be described through the associated Green function of Eq. 127|) , that 
is, the solution of 



L G(x, x') = 5(x — x'), (28) 

where S(x — a;') represents the Dirac's delta. G(x,x')is known also as a propagator of the field 
since given an initial configuration for the field ip(x°), the field at any other space-time event is 
obtained by 

*) = / G („>(*°M*°, 

that is, G(x, 2; Represents the propagation of a Dirac's delta signal from the event x° to the event 
x. In probabilistic terms, it gives the probability of finding a particle at the space-time event x 
given that it was at the event x . Propagators are of paramount importance in quantum field 
theory. 
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In general the evolution is far more complicated than that described by (|27[1 since interactions 
among several fields will be effective. Thus, electrons, having electric charge will interact between 
each other through the electromagnetic field (a gauge field in the jargon of quantum field theory) 
or, in terms of particles, by exchange of photons. Eq. i|27|) must then be supplemented with both 
a free term for the evolution of photons and an interaction term between photons and electrons. 
Within this frame (Quantum Electrodynamics) the interaction is weak enough for a perturvative 
approach in many interesting situations. However, there are other cases in which the coupling 
is so strong that perturvative techniques are useless. The paradigmatic case is the interaction 
of quarks (also fermions, like the electrons) inside nucleons (protons and neutrons), this time by 
exchange of particles called gluons describing an interaction known as Quantum Chromodynamics 
(QCD). In such a case, recourse is made to a discretization of the space-time (a space-time lattice) 
suitable for a numerical solution of the problem. Under such discretization Eq. (|28[1 leads to a 
matrix equation of the form LG = I, so that the propagator is given by L~ x . Interactions make 
L depend on the value of the gluon field at each spacetime event but still the discretized quark 
propagator is given by a matrix inversion. However, an extra average over an ensemble of different 
realizations of the fields is necessary. Altogether leads to an extremely demanding computational 
task and Lattice QCD is responsible of a mayor part of the CPU time consumed in Science (for a 
series of reviews in the computational aspects of Lattice QCD see [H] and in particular JSj)- Here 
we are only interested in testing the CC algorithm so that we restrict ourselves to the free case 
described by (|27l ~). Although trivial from the physical point of view, this simple case can be solved 
analytically 0] and therefore the solution can be checked. The final result after certain technical 
details reads (in a compact notation) 0], 



where Latin indexes run over lattice points, Greek indexes over space-time dimensions, if is a 
constant approaching 1 /8 in the continuum limit and 5^ is the standard Kronecker delta. In this 
notation a Latin index, say m, labels a lattice point while the v spatio-temporal coordinate of 
such a point is denoted by m v . Thus, in each lattice point a four dimensional matrix acting on 
the spinor components is defined. 

Explicitly in terms of both the spatio-temporal coordinates (Latin indexes) and spinor com- 
ponents (Greek indexes) L reads 

L /i u ab c d mnl k & [iv^a m&bn&cl&dk ~\~ 



In the next term in the sum 7 1 changes to 7 2 , a interchanges with band m with n and so for until 
the indexes are exhausted. The 7's matrices are defined as 




(29) 



K {S bn 5 c iS dk [<5 Q+ i m (J M „ +7^,) + 5 a -i 



m 



+•••}■ (30) 




and 




where the cr's, the 1 and the must be understood as two by two matrices. The er's are the show 
called Pauli matrices given by 
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with i representing the imaginary unit. Eq. 1)30(1 is useful to implement L in a program but to 
apply the CC algorithm as described in in section 2.4, a one to one mapping of the ten indexes of 
L to only two is also needed. We use the following one usual in QCD 

mf = l + a + N 1 {b + N 2 (c + N 3 (d + N a ^))) 
n' = l + m + N 1 (n + N 2 (l + N 3 (k + N y))), 

Here Ni, N%, N 3 and No are the number of discrete points in each spatio-temporal coordinate of 
the lattice (as above, corresponds to the time coordinate). The rank of the corresponding matrix 
is then 4N0N1N2N3 while expression (|30l) yields fourteen non zero elements per row so that the 
total number of non zero elements is 56 NqNi N 2 N 3 . 

Again two cases were tested. The parameters are given in table [21 and corresponds to two 
different spatio-temporal lattice sizes in both cases with the same number of discrete values for 
the four coordinates. The value of K was arbitrarily chosen to 0.1. 



Table 3: Topics of the Dirac's free fcrmion matrix in two cases with different lattice sizes 



Case 


1 


2 


Ni, N 2 , N 3 , N 


18 


20 


Rank of C 


419904 


640000 


Nonzero elements 


5878656 


8960000 


K 


0.1 


0.1 



Table 4: Results of CC and SE on Dirac's matrices (first case). 



CC method 


SE method 


N 


18 


Rounds per system 


55.03 


M-N 


10832 


Total rounds 


219167 


Eff. length size 


10805.0 


Number of systems 


3983 


E 


tr(ztQw) 


exact value 


413007.84+0i 






E 


4r(ztQw) 




413004.47-1. 87i 


E ir(ztQw) 


413005.08-l.98i 


Var ir(z^Qw) 


184130.43 


Var tr(z T Qw) 


67827.29 


MC St. error 


4.128 


MC St. error 


4.130 


CPU time per burn-in cycle 


1.43 






CPU time per cycle 


0.4746 


CPU time per round 


0.40 


CPU time per eff. cycle 


1.00 


CPU time per system 


21.89 


Total CPU time 


10859 


Total CPU time 


87167 
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Table 5: Results of CC and SE on Dirac's matrices (second case). 



CC method 


SE method 


N 


18 


Rounds per system 


56.08 


M-N 


6782 


Total rounds 


146379 


Eff. length size 


6848.8 


Number of systems 


2610 


E 


tr(z^Qw) 


exact value 


629489.14+0i 






E 


tr(z^Qw) 




629480. 89-0.53i 


E tr(zfQw) 


629482.78-0. li 


Var fr(z^Qw) 


270373.73 


Var ir(z^Qw) 


103395.82 


MC St. error 


6.283 


MC St. error 


6.294 


CPU time per burn-in cycle 


2.26 






CPU time per cycle 


1.54 


CPU time per round 


0.61 


CPU time per eff. cycle 


1.53 


CPU time per system 


33.99 


Total CPU time 


10503 


Total CPU time 


88721 



From inspection of tables 01 and it is clear the advantage of the CC algorithm whose perfor- 
mance again is around eight times higher than that of the SE. As explained above, this time it is 
possible to check the results against exact deterministic calculations. As expected they coincide 
with the estimated values within the imposed tolerance. 

4 Discussion and Conclusion 

Once the far superior efficiency of the CC algorithm has been demonstrated some comments about 
its applicability seems in order but first a note about the GS . Notice that the GS as implemented 
in J2J is just the particular case of CC corresponding to hermitian matrices since in this case 
z^ = w. Therefore its convergence is determined by sp(T) < 1 and the no singularity of D. These 
conditions do not implies C being positive definite. For instance, in the trivial case of C being 
diagonal, sp(T) — 0, and Eq. (0) reduces to z| — / '^/cU which, although implies imaginary 

(k) 

values of z\ ' works perfectly. 

In the general case when the convergence criteria are not satisfied at least two alternatives ex- 
its. One is to rewrite the algorithm with a different partition than that shown here. For instance, 
suppose that D is singular. Then, a possibility is to use a partition with relatively small non singu- 
lar diagonal blocks surrounding the zeros of D, amenable of been inverted deterministically with 
the memory recourses available. In fact, in terms of time efficiency, there will be partitions more 
efficient than the one used here but, obviously, without the appealing implementation simplicity 
of Eqs. © and (0J. As a rule of the thumb, the convergence will be guaranteed for a sufficiently 
"heavy" D. A new partition in the way just described could solve problems with small elements 
in the original D not only with zeros. 

Another simpler possibility is just to reorder the rows and the columns of C trying to locate 
in the diagonal large enough elements. In general, reordering which implies a low time penalty, 
could be advantageous to improve the convergence of the method. 

In summary, we have presented an efficient stochastic algorithm based in correlated Markov 
chains suitable for the inversion of very large matrix whenever the memory recourses are not 
enough for the application of the standard deterministic methods. The efficiency of the algorithm 
has been tested in a couple of numerical examples rendering a dramatic improving of eight times 
faster runs with respect to the best stochastic method known by the authors. The necessary and 
sufficient conditions for the convergence of the algorithm have been also given. 
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